Basically, this is a way to perform semi-supervised learning with an unreasonably complicated sequence of processing stages. First, we scale the data. Then, we transform the data into a sparse binary representation with a totally random tree embedding. Then, we use a restricted Boltzmann machine to extract a features from this representation.

Frankly, I'll be amazed if it works.


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 8, 12
plt.rcParams['axes.grid'] = True
plt.set_cmap('brg')


<matplotlib.figure.Figure at 0x7fc404861588>

In [2]:
cd ..


/home/gavin/repositories/hail-seizure

In [3]:
from python import utils

In [7]:
with open("settings/forestselection_gavin.json") as fh:
    settings = utils.json.load(fh)

In [8]:
with open("segmentMetadata.json") as fh:
    meta = utils.json.load(fh)

In [9]:
data = utils.get_data(settings)

In [10]:
da = utils.DataAssembler(settings,data,meta)

Then we just need to build training sets for each subject and apply the relevant models. Unfortunately, the cross-validator doesn't handle test segments so we won't be able to run any informative cross-validation.


In [13]:
import sklearn.ensemble
import sklearn.preprocessing
import sklearn.neural_network
import sklearn.svm

In [38]:
scaler = sklearn.preprocessing.StandardScaler()
embedding = sklearn.ensemble.RandomTreesEmbedding(n_estimators=1000)
rbm = sklearn.neural_network.BernoulliRBM(n_components=500)
classifier = sklearn.svm.SVC(kernel='linear',probability=True)

In [39]:
%%time
predictions = {}
for subject in settings['SUBJECTS']:
    print("Processing " +subject)
    Xtrain,ytrain = da.build_training(subject)
    Xtest = da.build_test(subject)
    
    X = np.vstack([Xtrain,Xtest])
    
    print("Applying scaling.")
    # then we want to fit preprocess the data
    X = scaler.fit_transform(X)
    print("Shape of data: {0}".format(X.shape))
    
    print("Applying Tree embedding.")
    X = embedding.fit_transform(X)
    print("Shape of data: {0}".format(X.shape))
    
    print("Applying RBM transformation.")
    X = rbm.fit_transform(X)
    print("Shape of data: {0}".format(X.shape))
    
    #slice Xtrain and Xtest back off of X
    Xtrain = X[:Xtrain.shape[0],:]
    Xtest = X[Xtrain.shape[0]:,:]
    
    print("Fitting classifier.")
    # then fit the classifier
    classifier.fit(Xtrain,ytrain)
    
    print("Classifying test data.")
    # then classify the test set
    predictions[subject] = np.hstack([da.test_segments[np.newaxis].T,\
                        classifier.predict_proba(Xtest)[:,1][np.newaxis].T])


Processing Dog_1
Applying scaling.
Shape of data: (1426, 3544)
Applying Tree embedding.
Shape of data: (1426, 22418)
Applying RBM transformation.
Shape of data: (1426, 500)
Fitting classifier.
Classifying test data.
Processing Dog_2
Applying scaling.
Shape of data: (1993, 3544)
Applying Tree embedding.
Shape of data: (1993, 22561)
Applying RBM transformation.
Shape of data: (1993, 500)
Fitting classifier.
Classifying test data.
Processing Dog_3
Applying scaling.
Shape of data: (3679, 3544)
Applying Tree embedding.
Shape of data: (3679, 22674)
Applying RBM transformation.
Shape of data: (3679, 500)
Fitting classifier.
Classifying test data.
Processing Dog_4
Applying scaling.
Shape of data: (2641, 3544)
Applying Tree embedding.
Shape of data: (2641, 22867)
Applying RBM transformation.
Shape of data: (2641, 500)
Fitting classifier.
Classifying test data.
Processing Dog_5
Applying scaling.
Shape of data: (1071, 3120)
Applying Tree embedding.
Shape of data: (1071, 20369)
Applying RBM transformation.
Shape of data: (1071, 500)
Fitting classifier.
Classifying test data.
Processing Patient_1
Applying scaling.
Shape of data: (319, 3120)
Applying Tree embedding.
Shape of data: (319, 16898)
Applying RBM transformation.
Shape of data: (319, 500)
Fitting classifier.
Classifying test data.
Processing Patient_2
Applying scaling.
Shape of data: (260, 7908)
Applying Tree embedding.
Shape of data: (260, 17632)
Applying RBM transformation.
Shape of data: (260, 500)
Fitting classifier.
Classifying test data.
CPU times: user 2h 59min 54s, sys: 33min 35s, total: 3h 33min 30s
Wall time: 3h 33min 56s

Writing this to a file for submission:


In [34]:
import csv

In [35]:
with open("output/svc_tree_embedded_rbm_transform_pg.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for subject in settings['SUBJECTS']:
        for line in predictions[subject]:
            c.writerow(line)

In [36]:
!wc output/svc_tree_embedded_rbm_transform_pg.csv


  3936   3936 193900 output/svc_tree_embedded_rbm_transform_pg.csv

In [37]:
!head output/svc_tree_embedded_rbm_transform_pg.csv